A Deep Dive into Course Descriptions. Using Quanteda to Identify Work-Based Learning Opportunities

Data@Urban Draft

Authors

Manuel Alcalá Kovalski and Judah Axelrod

Published

June 12, 2023

This blog post is part two of a series on analyzing work-based learning opportunities in community colleges. In part one, we discussed how we used web scraping to gather course descriptions from community colleges in Florida. Now, we’ll delve into how we analyzed this data using the quanteda package in R.

1 Introduction

In our previous post, we detailed our journey of collecting course descriptions from Florida’s community colleges using web scraping techniques. We successfully compiled a comprehensive dataset, but the question remained: how do we make sense of this vast amount of text data? Enter quanteda, an R package designed for quantitative text analysis.

2 Getting Started with Quanteda

Quanteda, short for Quantitative Analysis of Textual Data, is a powerful tool for managing and analyzing text data in R. It offers a suite of functions for corpus management, creating document-feature matrices, analyzing keywords, and more. These functions are highly efficient and provide a consistent interface with support for multiple languages. While it operates as a standalone package, it also integrates seamlessly with extensions such as readtext, spacyr, and quanteda.textstats.

To install and load the required packages, we can use the librarian package for convenience

Code
librarian::shelf(tidyverse, # data wrangling
                 quanteda, # text mining
                 readtext # read texts and associated document-level meta-data
)

Next, we load our text data containing course descriptions as well as document-level meta-data using the readtext::readtext() function. We then perform some additional cleaning steps to standardize variable names and focus our analysis on active courses for credit.

Code
courses <-
  readtext(here::here("data/data-intermediate/course-descriptions.csv"), 
           text_field = "course_description",
           docid_field = 'course_id') %>% 
  janitor::clean_names() %>% 
  filter(course_status == "ACTIVE",
         type_of_credit == "COLLEGE CREDIT")

The first step in our analysis is to create a corpus, a collection of text documents, from our course descriptions. We can achieve this using the corpus() function in quanteda. Next, we extract the tokens in the corpus—usually words, but they can also be n-grams or multi-word expressions. The tokens function allows us to define what we mean by tokens and apply some rules to ignore elements such as punctuation and digits.

Code
corpus <- corpus(courses)
Corpus consisting of 5 documents and 29 docvars.
BC-THE-2300 :
"A STUDY OF DRAMATIC LITERATURE FROM THE TIME OF THE EARLY GR..."

BC-JST-1500 :
"A SURVEY OF JEWISH CULTURE (JST1500) IS AN EXAMINATION OF JE..."

BC-LEI-1700 :
"AN OVERVIEW OF THE CHARACTERISTICS AND NEEDS OF MEMBERS OF S..."

BC-JOU-2200 :
"COURSE PROVIDES INSTRUCTION AND PRACTICAL EXPERIENCE IN COPY..."

BC-FRE-1121 :
"CONTINUATION OF FRE 1120. FURTHER DEVELOPMENT OF THE BASIC S..."
Code
tk <- tokens(corpus, what = "word", remove_punct = TRUE, remove_numbers = TRUE)
Tokens consisting of 5 documents and 29 docvars.
BC-THE-2300 :
 [1] "A"          "STUDY"      "OF"         "DRAMATIC"   "LITERATURE"
 [6] "FROM"       "THE"        "TIME"       "OF"         "THE"       
[11] "EARLY"      "GREEKS"    
[ ... and 56 more ]

BC-JST-1500 :
 [1] "A"           "SURVEY"      "OF"          "JEWISH"      "CULTURE"    
 [6] "JST1500"     "IS"          "AN"          "EXAMINATION" "OF"         
[11] "JEWISH"      "THOUGHT"    
[ ... and 18 more ]

BC-LEI-1700 :
 [1] "AN"              "OVERVIEW"        "OF"              "THE"            
 [5] "CHARACTERISTICS" "AND"             "NEEDS"           "OF"             
 [9] "MEMBERS"         "OF"              "SPECIAL"         "GROUPS"         
[ ... and 12 more ]

BC-JOU-2200 :
 [1] "COURSE"      "PROVIDES"    "INSTRUCTION" "AND"         "PRACTICAL"  
 [6] "EXPERIENCE"  "IN"          "COPY"        "EDITING"     "REWRITING"  
[11] "HEADLINE"    "WRITING"    
[ ... and 19 more ]

BC-FRE-1121 :
 [1] "CONTINUATION" "OF"           "FRE"          "FURTHER"      "DEVELOPMENT" 
 [6] "OF"           "THE"          "BASIC"        "SKILLS"       "IN"          
[11] "SPEAKING"     "LISTENING"   
[ ... and 51 more ]

3 Key-Term Searches with a Dictionary

Our primary interest lies in identifying courses related to different types of work-based learning such as internships, apprenticeships, or practicums. For each type of work-based learning experience, we create a list of terms that we want to treat equivalently. For instance, our dictionary can specify that a course description refers to a clinical WBL if either of the terms “clinicals” or “clinical experience” appear.

Code
dict <-
  dictionary(list(apprenticeship = "apprentice*",
                  practicum = c("practicum", "practica"),
                  coop = c("co-op", "cooperative education", "co-operative education"),
                  clinicals = c('clinicals', 'clinical experience'),
                  on_the_job = c("on the job training", "job training", "on-the-job training"),
                  wbl = c('work-based learning', "work based learning", "wbl"),
                  real_world_experience = c("real-world experience", "real world experience"),
                  service_learning = "serive learning",
                  field_experience = c("fieldwork", "field experience", "field-experience")))

Armed with our dictionary, we’re ready to search for the terms using the kwic() (key-word in context) function. This function takes in a corpus and a dictionary as inputs, along with a window parameter specifying the number of tokens before and after a keyword that we want to see for context.

Code
keywords <- kwic(corpus, dict, window = 10) %>% as_tibble()

Finally, we can join the results from the dictionary-based keyword in-context search back to the course level data and perform some wrangling to analyze the prevalence of work-based learning opportunities in Florida’s community colleges.

Code
courses_with_keywords <-
  keywords %>% 
  left_join(courses, by = c("docname" = "doc_id")) %>%
  mutate(sentence = glue("{pre} {keyword} {post}") |> str_trim() |> str_to_sentence()) %>% 
  separate(docname, into = c("school", "course"), sep = "-", extra = 'merge'  ) %>% 
  mutate(discipline = str_remove(discipline, ".* - ")) %>% 
  select(school, discipline, course, statewide_course, degree_type, course_credits,  sentence, pattern) |>
  arrange(school, pattern, discipline) %>% 
  mutate(pattern = str_to_title(pattern) |> str_replace_all('_', '-'))
Code
freq_table <-
  courses_with_keywords %>%
  group_by(school, pattern) %>%
  count() %>%
  group_by(pattern) %>%
  bind_rows(summarise(.,
                      across(where(is.numeric), sum),
                      across(where(is.character), ~ "total"))) %>% 
  ungroup() %>%
  pivot_wider(names_from = pattern, values_from = n) %>%
  mutate(across(where(is.numeric), ~ replace_na(.x, 0))) %>% 
  slice(n(), 1:n()-1)

4 Conclusion

In this post, we’ve demonstrated how to use quanteda to analyze course descriptions and identify work-based learning opportunities in community colleges. While our analysis focused on Florida, the same methods could be applied to other states or regions.

Through this analysis, we’ve gained valuable insights into the prevalence of work-based learning in Florida’s community colleges. We hope that our work can serve as a foundation for further research and policy discussions on this important topic.

The power of quanteda lies not just in its ability to handle large text data, but also in its flexibility. It allows researchers to tailor their analysis to their specific needs, whether that’s identifying key terms, comparing text documents, or exploring text patterns.

This blog post was co-authored by Manuel Alcalá Kovalski and Judah.